Titanic Classification Project

By: Victoria Engler

Description: I recently decided I wanted to pursue Data Science full time. Coming from a background of Data Analytics and Python programming, I remembered we tried this project out in one of my first DS courses.

I couldn't remember if we ever actually completed it end to end, but I hoped to use this today to re-teach myself some of the DS fundamentals and strengthen my ever-growing skillset and knowledge.

For those who are unaware, the Titanic dataset challenge is a popular beginner's project, the challenge is to create a classifier that will predict whether or not an individual died or survived in the wreck. Going through various models, I worked to decide if one was overfitted, underfitted, if it was even the right one to use. I learned from this project that there's not set answer in DS and that there are so many different methods and resources out there to learn from that will only help me ask tougher questions about the data to be as accurate as possible. In the end, I am still left with questions and plan to keep tuning the below models to help me decide what the best path of success is in any challenging scenario.

Loading in all the packages used

Loading data

Appending two csvs together to do some EDA

Setting up my data for success

It's always interesting to observe the way the world was back then and compare it to now. I was interested below to see what the fare differences were by both class and sex.

It's also interesting how females had an average higher fare, makes me wonder if the higher fare was due to the lower amount of women and children.

I'm also curious about the number of kids and how much they cost based on sex. The number of kids in each class were pretty consistent.

The fare for boys was slightly higher in all classes

There were significantly more men then women in all classes, except class.

In first class, women surprisingly cost more than men.

Now to the data transformation and classification.

Creating the preprocessor and pipeline for the initial analysis

Defining 4 different models, just to take a look at how they do and see if I can adjust any of the hyper params

Clearly, DecisionTreeClassifier and LogisticRegression performed the best at first glance, now to dig deeper below

Seeing that DTC got a 1.0, is a clear sign to me that this is overfitted. However, arriving at the AUC (area under the curve) score of .95, I feel a bit more confident that my ability to properly predicts True and False might be okay.

A bit more variability with the Logistic Regression Classifier... will be looking more deeply at these results to truly interpret the best route, however, I feel confident in the Logistic Regression due to its consistency and accuracy score.

Next steps: